Red Wine Analysis by Awad Bin-Jawed

This report explores a dataset of Red Wines which contains 1599 observations each containing quality, pH values and other chemical components. In this report the important variables contributing to quality of red wines are explored.

Univariate Plots Section

## [1] 1599   12

There are 1599 observations and 13 features in the RedWine data.

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Most of the features are numeric. Quality is the integer measuring quality of the red wine starting from 3 to 8. This variable can be treated as categorical variable.

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
##  Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47       Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
##  1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
##  Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000

Our dataset consists of 13 variables, with 1599 observations. This data set shows quality of red wines which depends on various factors like pH level, amount of alchohol present, density of the wine amount of other chemical compositions. As summary of the dat shows pH values are in the range 2.74 to 4.01. Alchohol level present is 8.4 to 14.9. We can explore these features and find interesting patterns which can be used to detect important features that can impact the quality of the red wines.

Alcohol distribution is found to be right skewed with an outlier at level 15. It will be interested to explore the distribution of alcohol across all wine quality.

Distribution of sulphates is almost normal if we ignore the outlier beyond 1.2. Sulphate is found to be second most important variable in measuring quality of the wine.

Total sulfur dioxide has few outliers at very far at around 300. Overall distribution is right skewed. We have transformed this feature by removing outliers right plot shows the distribution after removing outliers. There few smaller peaks at around 80. It will be interested to analyse these distribution of each quality of the wine.

pH values are normally distributed. There are few wines whose pH values are above 3.7. It will be interesting to know the relation between quality of wine with pH values, we can explore to know does high quality wines are more or less acidic.

fixed.acidity looks normally distributed which shows maximum wines have fixed.acidity between 6 to 9. There are few wines whose fixed.acidity is around 16 which is quite higher than the centered mean value. There are few peaks at around 11 which could be a different group. It would be interested to know how does fixed.acidity behaves for different quality of wines.

Volatile.acidity also shows normal distribution which slightly right skewed. There are few outlier values at 1.5. There are three peak values around the center which could be different distributions for different qualities.

Distribution of citric acid shows it is right skewed, pattern shows it is multimodel. There is an outlier at 1. It will be good to distribute this dat into groups of 0-0.20, 0.21-0.60 and above 0.6.

Sugar pattern is right-skewed with maximum value falling in the range 0-4. There are few outliers after sugar level 8.

Chloride distribution is also right-skewed with maximum value falling in the range 0.05-0.15. There are few outliers after chloride level 0.2. It will be good to explore these outliers if they are from a particlar quality of wine which consists of higher level of sugar and chloride.

Density is normally distributed with almost symmetrical across tails. It is well centered around 0.997.

We can see most of the observations are of quality 5 followed by 6 and 7. There are few samples of quality 3, 4 and 8.

Univariate Analysis

What is the structure of your dataset?

There are 1599 redwine observations in the dataset with 13 features. All features contains numeric dat except the quality which is a category variable containing levels 3 to 8.

What is/are the main feature(s) of interest in your dataset?

Main feature of the interest of the dataset is the quality of red wines. It would be interesting to know what factors influence the quality of the red wine. Using these factors a predictive model can be built which can be used for predicting the quality of the red wine.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

The other features of interest are alcohol, sulphates, pH, fixed acidity, volatile acidity, citric acid, free sulfur dioxide and density. I would be interesting to explore the relationship of the these features with the quality of the wine.

Did you create any new variables from existing variables in the dataset?

Yes, I have created partitions for citric acid feature. As there are several peaks forming multi-modal distributions so each values is grouped into partitions of 0-0.20, 0.2-0.6 and above 0.8.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

We have performed principal component analysis (pca) to find the associations among features. Using biplot we have observed that total.sufur.dioxide and free.sulfur.dioxide are highly associated. We have also performed random forest to find the important variables measuring the quality of wine. We have found alcohol is the most important feature followed by sulphates and other features.

Bivariate Plots Section

##                      fixed.acidity volatile.acidity citric.acid residual.sugar
## fixed.acidity           1.00000000     -0.256130895  0.67170343    0.114776724
## volatile.acidity       -0.25613089      1.000000000 -0.55249568    0.001917882
## citric.acid             0.67170343     -0.552495685  1.00000000    0.143577162
## residual.sugar          0.11477672      0.001917882  0.14357716    1.000000000
## chlorides               0.09370519      0.061297772  0.20382291    0.055609535
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813    0.187048995
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302    0.203027882
## density                 0.66804729      0.022026232  0.36494718    0.355283371
## pH                     -0.68297819      0.234937294 -0.54190414   -0.085652422
## sulphates               0.18300566     -0.260986685  0.31277004    0.005527121
## alcohol                -0.06166827     -0.202288027  0.10990325    0.042075437
## quality                 0.12405165     -0.390557780  0.22637251    0.013731637
##                         chlorides free.sulfur.dioxide total.sulfur.dioxide
## fixed.acidity         0.093705186        -0.153794193          -0.11318144
## volatile.acidity      0.061297772        -0.010503827           0.07647000
## citric.acid           0.203822914        -0.060978129           0.03553302
## residual.sugar        0.055609535         0.187048995           0.20302788
## chlorides             1.000000000         0.005562147           0.04740047
## free.sulfur.dioxide   0.005562147         1.000000000           0.66766645
## total.sulfur.dioxide  0.047400468         0.667666450           1.00000000
## density               0.200632327        -0.021945831           0.07126948
## pH                   -0.265026131         0.070377499          -0.06649456
## sulphates             0.371260481         0.051657572           0.04294684
## alcohol              -0.221140545        -0.069408354          -0.20565394
## quality              -0.128906560        -0.050656057          -0.18510029
##                          density          pH    sulphates     alcohol
## fixed.acidity         0.66804729 -0.68297819  0.183005664 -0.06166827
## volatile.acidity      0.02202623  0.23493729 -0.260986685 -0.20228803
## citric.acid           0.36494718 -0.54190414  0.312770044  0.10990325
## residual.sugar        0.35528337 -0.08565242  0.005527121  0.04207544
## chlorides             0.20063233 -0.26502613  0.371260481 -0.22114054
## free.sulfur.dioxide  -0.02194583  0.07037750  0.051657572 -0.06940835
## total.sulfur.dioxide  0.07126948 -0.06649456  0.042946836 -0.20565394
## density               1.00000000 -0.34169933  0.148506412 -0.49617977
## pH                   -0.34169933  1.00000000 -0.196647602  0.20563251
## sulphates             0.14850641 -0.19664760  1.000000000  0.09359475
## alcohol              -0.49617977  0.20563251  0.093594750  1.00000000
## quality              -0.17491923 -0.05773139  0.251397079  0.47616632
##                          quality
## fixed.acidity         0.12405165
## volatile.acidity     -0.39055778
## citric.acid           0.22637251
## residual.sugar        0.01373164
## chlorides            -0.12890656
## free.sulfur.dioxide  -0.05065606
## total.sulfur.dioxide -0.18510029
## density              -0.17491923
## pH                   -0.05773139
## sulphates             0.25139708
## alcohol               0.47616632
## quality               1.00000000

Correlation plot has been created and it is ordered based on hierarichal clustering which reveals that chloride, sulphates, density, fixed.acidity and citric acid have negative correlation with each other. Above correlation matrix plot shows that chloride, sulphates, density, fixed.acidity and citric acid are grouped together. This pattern was releaved by our PCA analysis biplot. We find these components have stong association with each other and these variables can be further ingestivated to see how these impact the quality of wine. pH values has strong negative correlation with fixed.acidity and citric acid. It is slightly negative correlated with chloride, sulphate and density. pH shows that it is positive correlated with alcohol and volatile.acidity.

The above plot helps in visualising relationship of each variable with each other along with correlation coefficient in the upper right triangle. Fixed.acidity and citric acid have strong positive correlation with each other. These two variables are not much of importance in deciding the quality of the wine. However, closely looking at the relation of volatile.acidity and citric.acid we can see that volatile.acidity is the importance varaiable and citric.acid is highly negatively correlated with volatile.acidiy. It would be nice idea to explore citric.acid or transform this variable and see how it can affect the overall quality of the wine. Density is also highly positively correlated with fixed.acidity. We are also interested to see how pH values is related with other variables. It is negatively correlated with fixed.acidity and citric acid, which is natural. Free and total sulfur dioxide are highly positively correlated with each other.

Biplot of first two principal components shows that total.sufur.dioxide and free .sulfur.dioxide are highly associated. Similalry citric.acid, fixed.acidity and sulphates are also associated with each other. Sugar, density and chlorides are also associated and correlated. This is very useful in analysing the relationship of these features.

We have applied random forest to identify important variables in the data set. It is observed that alcohol is most important variable in deciding the quality of the wine. It is followed by sulphates, total.sulfur.dioxide and volatile.acidity and density. Chroides, fixed.acidity and pH all are almost of equal importance. Other variables doesn’t seem to have big impact in the quality of the wine. It gives an idea about which variables are important to explore further.

As we have found that alhocol is the most important feature describing the quality of the wine. This plot shows that the distribution of each type of wine is relativily different from each other, which justify the choice of alcohol being selected as the most important feature. We can do t-test to conclude that the distributions are statistically different or not.

We can see that most of the observation of the alcohol are with the wine of quality 5, 6 and 7. And as quality of the wine increases the centeral mean of the distribution is also shifting towards right.

We are interested to see how the distribution of sulphate is distributed across quality of wine. The left plot shows the distribution of sulphate which contains lot of outliers. In order to see patterns more precisely we have removed the outliers from the dat and right plot shows distribution more clearly. Patter shows that sulphate also distinguishes the quality of wine as mean values looks significately different.

We observed that the most of the distribution of sulphates is again for wine with quality 5, 6 and 7. However, the mean of the sulphate is not significantly different for each quality of wine. As most of the distributions are right skewed and have some outliers. After removing outlier these distributions have become more seprable.

Moving ahead with the next important feature which is total.sulfur.dioxide. It is obsered that it has outliers for wine quality 6 and 7. So we have removed them to see the distribution without outlier. Distruibution looks overlapped with different means. We can perform t-test analysis to test if the means are significantly different or not. Distribution of wine quality of 7 and 8 looks similar centered at same mean.

Total sulfure dioxide is an important feature of interest, its obersations are mostly falls in the wine quality of 5, 6 and 7. In each quality the distribtion are right skewed and contains a few outlier. This suggests to remove these outlier and transform the distribution for each quality.

We can see the distributions of volatile.acidity is getting less variance and its shifting down. So distributions are different for wine with quality 3 & 4 from 6, 7 and 8. Right plot shows the distribution pattern when we remove outlier from the distributions. From the distribution pattern it looks it can clearly distingush distribution of wine quality 5 or less with quality more than 5. As we can see that this decreasing pattern as quality increases is negatively correlated with sulphate pattern.

We can see that the boxplot reveals an interesting pattern about volatile acidity, its distributions are shifting down as the quality of wine increases. This pattern is evitable from the distribution plot also. The distribution of wine quality 5, 6 and 7 are almost normal with few outlier. As we can see low quality wine contains higher amount of volatile acidity and its deceases as the qulality of the wine increases.

From the plot it is clearly visible that distribtion of density is overrlapping across different quality of wine. And it is difficult to distinguish wine quality from the density feature.

There are very few observations for wine with quality 3 and 8. The distribution of quality of wine with 5 and 6 are almost idendical with equal means. Low quality wines have higher volatile acidity and as quality of wine increases the average volatile acidity is found more in the wine.

pH values distribution for lower quality wine looks different from higher quality wine. However, it is highly overllapped for the middle quality wine. Distribution patterns looks almost similart even after removing outliers.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

We have observed interesting relationships between the distrubtion of alcohol for different quality of wine. It is found that alcohol distrubution is relatively different for each quality of wine. Interesting we observed this relationship is high for alcolol and followed by sulphates and other features. It complements with our initial findings abouts important variables. We observed distribution of less important features is not much different for subsequent quality of wines.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Yes, we have observed from correlation plots that chloride, sulphates, density, fixed.acidity and citric acid have negative correlation with each other. Since we plotted correlation matrix and ordered them using hierarchical clustering we found that these features are grouped together. It is also found that pH values have strong negative correlation with fixed.acidity and citric acid, it si slightly negative correlated with chloride, sulphate and density. Strong positive correlation pattern is observed between pH, alcohol and volatile.acidity. This patterns are quite natural as pH value decreseas solution becomes more acidic.

What was the strongest relationship you found?

We have observed strong relationship between density and alcohol. In general alcohol are less denser, from this relationship it is justified that if solution is more densed it would be less alcoholic and vice-versa. Since it is found that alcohol is the most important feature deciding the quality of the wine. This relationship is quite important to observe. As we further investigated the relationship we observed that density is highly correlated with fixed acidity. It would be interesting to transform the fixed acidity feature into groupts to see how it affects the quality of the wine.

Multivariate Plots Section

From the random forest fit, we found that alcohol is being the most important feature in defining the quality of the wine and followed by sulphates. We observed from the correlation matrix that suphate has some positive correlation with the chloride. We can see from this plot higher the value of sulphates higher the dot size which reflects the chloride level and dot size is small as the sulphate levels is low.

There are many drinks which are made of alcohol and citrus juice containing a carboxylic acid. Citrus juice is the source of citric acid. So we further investigated the distribution of sulphates and alcohol along with various citrus level for all quality. An interesting pattern is observed when wine quality is 7 and citric acid level is 0-0.2, in this case relaship between alcohol and sulphate is positive while in other cases it is negative.

We can see that correlation between alcohol and total sulfur dioxide is negative for low quality of wines and its going more positive as the quality of the wine increases. Free sulfur dioxide and total sulfur dioxides are positively correlated as larger dots are at higher values of total sulfur dioxide and vice-versa.

It shows an interesting pattern when we separate relationship between total sulfur dioxide and alcohol based on different citric acid levels, we observed that distribution is dense and aggregated. That shows initial dat points which appeared as outliers are clearly separated out.

Interestingly we see that as volatile acidity is higher with respect to alcohol for low quality wines as compared with the high quality wines. This pattern is so interesting as it reveals that as quality of the wine increases the distribution is shift down by which we can say that high quality wines tend to have low volatile acidity. We also observed that low citric acid values appears to be outlier for all qualities.

Most of the dat is distributed among wines of quality 5,6 and 7. In each sub-plot distributions are compact without outliers. Citric acid is playing an important role with these features. As we saw citric acid and volatile acidity are negatively correlated with each other.

Residual sugar is high when alcohol is low and density is high. It is getting low as density and alcohol level increases.

For each citric level the trend between alcohol and density is negatively sloped with most of the dat points are distributed in wine of quality 5, 6 and 7.

Fixed acidity is negatively correlated with pH values and this trend is clearly visible in this plot. Larger dots appears at the low value of fixed acidity. For higher values of fixed acidity the pH values are low. Alcohol and fixed acidity are negatively correlated among all type of wine qualities.

Overall trends between fixed acidity distribution and alcohol was highly negatively correlated for almost all quality of wines. However, as we see distribution for different levels of citric acid. The relation is slightly negative correlated. We can conclude that citric acid is clearly separating distributions of fixed acidity and alcohol into groups.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Yes, we have observed strong relationship with our feature of interest quality with other features of interest alcohol, sulphates, total sulfur dioxide, volatile acidity, density and fixed acidity. It is obeserved that alcohol distribution is the most distinguisable across quality of wines. High quality wines have more amount of alcohol and sulphates. Amount of total sulfur dioxide appears to be more in wine of quality 5 as compared to other quality. Density and Volatile acidity appears to be less in high quality wines. pH values does not seem to make much difference in the quality of wine.

Were there any interesting or surprising interactions between features?

Yes, we have observed that though there was a negative correlation between sulphates and alcohol, however, chloride level appears to be less for low values of sulphates and high for higher values. Also as level of alcohol increases quality of wine improves mean of sulphate distribution increases with low level of chlorides. It is also observed that only sulphate and alcohol are positively correlated for high quality wines when citric acid was in the range 0-0.2. Alcohol is found to be negatively correlated with total sulfur dioxide for low quality wines while it is positively correlated for higher quality wines.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

Yes, We have created a model using the dataset, we have created a random forest model. We used quality as our dependent variable and all othere features of the dataset as predictors. Random forest model have suggested that alcohol is being the most important feature in deciding the quality of the wine, which is followed by sulphates ——

Final Plots and Summary

Plot One

Description One

In this plot we have observed that more values falls at multiple levels of citric acid. From plot we can see that first peak model is forming in the range 0-0.20, followed by 0.2-0.6 and above 0.6. We can divide this distribution into three groups. We have seen citric acid discovered interesting pattern in the case of volatile and fixed acidities relationship with alcohol across different quality of wines.

Plot Two

Description Two

Alcohol is rightly skewed when plotted overall. However, as soon as we plotted it for each quality we have discovered interesting patterns. We can see that as quality of wine increases maily from 5, alcohol distribution increasing significantly. This suggest that alcohol being the most important candidate in measuring the quality of the wine.

Plot Three

Description Three

For low quality wines and low level of alcohol voltile acidity level is higher as compared to the high quality wines. From quality 5 we have observed that relation between alcohol and sulphates getting more positive correlated. Also level of citric acid appears to be more with the increase in wine quality.


Reflection

In this dataset we have explored each features individualy. We explored how each of the feature is distributed and their potential outlier. For many features like sulphates, total sulfur dioxide, pH, fixed acidity, volatile acidity, fixed acidity, residual sugar and chlorides, we observed that there are few values which are on the far of right-skewed tail of the distribution. On furthur exploration we found these are the observations related to higher or lower quality of the wines.

We have also explored the relationship of features with each other. We conducted principal component analysis on the dataset. We found that fixed acidity, citric acid and sulphates are highlt associated. Since PCA doesn’t consider dependent variable while finding relationship. So we fitted the random forest model to find the important features that can effect the quality of the wine. We have also explored the relationship of these features e.g. we explored how alcohol and sulphates are related with each other for each quality of the wines and also for different level of citric acids. We observed the relation is positive when quality of wine is high and citric acid level is low.

We faced challenge to transform new feature from the exisiting feature as it required some level of domain expertise. We also observed that most of the observations are of wine quality 5, 6 and 7, and very less for 3, 4 and 8. Due to insufficient observations in these category can impact in finding true distributions and relationship.

In future, it we would like to conduct the statistical test to measure the difference of mean for each feature among several quality of wine. This will be helpful in deciding most important feature set. Using these feature set we can build more powerful predictive model. We would also like to measure the quality of fit of the model.